Breast Cancer Prediction Through Gene Expression Arrays

Overview

This project leverages gene expression data to predict breast cancer categories, including Normal, Benign, and Malignant. Using the National Center for Biotechnology Information (NCBI) dataset (GDS3952), the study addresses key challenges such as the "curse of dimensionality," overfitting, and multicollinearity. The aim is to create a reliable machine learning model for early detection and improved patient outcomes.

Methods and Tools

The dataset contains over 50,000 genes across 125 samples, categorized as Normal, Benign, and Malignant. Data preprocessing involved feature selection through ANOVA and Tukey tests, reducing dimensionality to 58 significant genes. Machine learning models such as Decision Tree, KNN, Random Forest, and SVM were trained using both repeated and stratified group k-fold cross-validation. The best-performing model was a tuned Random Forest classifier, evaluated using leave-one-out cross-validation (LOOCV).

Key Findings

The tuned Random Forest model achieved an F1-score of 0.768 and a recall of 78.9% for malignant samples, demonstrating its reliability in distinguishing between classes. Feature selection identified significant differences in 3441 genes, further refined to 58 for optimal classification performance. The model performed best on normal samples, with a recall of 80.6%, and showed moderate performance for benign and malignant cases due to their closer similarities.